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Abstract 

This paper studies the problem of recovering a non-negative sparse signal a; e K" from highly 
corrupted linear measurements y = Ax + e G M™, where e is an unknown error vector whose nonzero 
entries may be unbounded. Motivated by an observation from face recognition in computer vision, this 
paper proves that for highly correlated (and possibly overcomplete) dictionaries A, any non-negative, 
sufficiently sparse signal x can be recovered by solving an -minimization problem: 

min ||a;||i + ||e||i subject to y = Ax + e. 

More precisely, if the fraction p of errors is bounded away from one and the support of x grows sublinearly 
in the dimension m of the observation, then as m goes to infinity, the above -minimization succeeds 
for all signals x and almost all sign-and-support patterns of e. This result suggests that accurate recovery 
of sparse signals is possible and computationally feasible even with nearly 100% of the observations 
corrupted. The proof relies on a careful characterization of the faces of a convex polytope spanned 
together by the standard crosspolytope and a set of iid Gaussian vectors with nonzero mean and small 
variance, which we call the "cross-and-bouquet" model. Simulations and experimental results corroborate 
the findings, and suggest extensions to the result. 

Index Terms 

Sparse Signal Recovery, Dense Error Correction, ^^-minimization, Gaussian Matrices, Polytope 
NeighborUness. 

I. Introduction 

Recovery of high-dimensional sparse signals or errors has been one of the fastest growing research areas 
in signal processing in the past few years. At least two factors have contributed to this explosive growth. 
On the theoretical side, the progress has been propelled by powerful tools and results from multiple 
mathematical areas such as measure concentration [l]-[3], statistics [4]-[6], combinatorics [7], and coding 
theory [8]. On the practical side, a lot of excitement has been generated by remarkable successes in 
real-world appUcations in areas such as signal (image or speech) processing [9], communications [10], 
computer vision and pattern recognition [11]-[13] etc. 

A. A Motivating Example 

One notable, and somewhat surprising, successful application of sparse representation is automatic face 
recognition. As described in [11], face recognition can be cast as a sparse representation problem. For 
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each person, a set of training images are taken under different illuminations. We can view each image 
as a vector by stacking its columns and put all the training images as column vectors of a matrix, say 
A G W^^^. Then, m is the number of pixels in an image and n is the total number of images for all 
the subjects of interest. Given a new query image, again we can stack it as a vector y G R™. To identify 
the image belongs to which subject, we can try to represent y as a linear combination of all the images, 
i.e., y = Ax for some x G M". Since in practice n can potentially be larger than m, the equations 
can be underdetermined and the solution x may not be unique. In this context, it is natural to seek the 
sparsest solution for x whose large non-zero coefficients then provide information about the subject's 
true identity. This can be done by solving the typical ^^-minimization problem: 

min||s||i subject to y = Ax. (1) 

X 

The problem becomes more interesting if the query image y is severely occluded or corrupted, as 
shown in Figure [T] left, column (a). In this case, one needs to solve a corrupted set of linear equations 
y = Ax + e, where e G is an unknown vector whose nonzero entries correspond to the corrupted 
pixels. For sparse errors e and tall matrices A (m > n), Candes and Tao [14] proposed to multiply the 
equation y = Ax + e with a matrix B such that BA = 0, and then use ^^-minimization to recover the 
error vector e from the new linear equation By = Be. 

As we mentioned earlier, in face recognition (and many other applications), n can be larger than m 
and the matrix A can be full rank. One cannot directly apply the above technique even if the error e is 
known to be very sparse. To resolve this difficulty, in [11], the authors proposed to instead seek [x,e] 
together as the sparsest solution to the extended equation y = [A l]w with = [g] G M™^"", by solving 
the extended £^ -minimization problem: 

min||t(;||i subject to y = [A l]w. (2) 

w 

This seemingly minor modification to the previous error correction approach has drastic consequences on 
the performance of robust face recognition. Solving the modified £^ -minimization enables almost perfect 
recognition even with more than 60% pixels of the query image are arbitrarily corrupted (see Figure [T] 
for an example), far beyond the amount of error that can theoretically be corrected by the previous error 
correction method [14]. 

Although £^ -minimization is expected to recover sufficiently sparse solutions with overwhelming 
probability for general systems of linear equations (see [16]), it is rather surprising that it works for the 
equation y = [Al]w dl all. In the application described above, the columns of A are highly correlated. 
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Fig. 1 

Face recognition under random corruption. Left: (a) Test images y with random corruption from the database 

PRESENTED IN [15]. TOP ROW: 30% OF PIXELS ARE CORRUPTED, MIDDLE ROW: 50% CORRUPTED, BOTTOM ROW: 70% 
CORRUPTED. (B) ESTIMATED ERRORS e. (C) ESTIMATED SPARSE COEFFICIENTS X. (D) RECONSTRUCTED IMAGES 

^ Ax. The extended £^ -minimization j2j correctly recovers and identifies all three corrupted face 
IMAGES. Right: The recognition rate across the entire range of corruption for all the 38 subjects in the 

DATABASE. IT PERFORMS ALMOST PERFECTLY UPTO 60% RANDOM CORRUPTION. 



As m becomes large (i.e. the resolution of the image becomes high), the convex hull spanned by all 
face images of all subjects is only an extremely tiny portion of the unit sphere S™^^{^For example, the 
images in Figure [l]Ue on S^'°^3 The smallest inner product with their normalized mean is 0.723; they are 
contained within a spherical cap of volume < 1.47 x 10^^^^. These vectors are tightly bundled together 
as a "bouquet," whereas the vectors associated with the identity matrix and its negative ±1 togethei]^ 
form a standard "cross" in W^, as illustrated in Figure [2] Notice that such a "cross-and-bouquet" matrix 
[A I] is neither incoherent nor (restrictedly) isometric, at least not uniformly. Also, the density of the 
desired solution w is not uniform either. The x part of w is usually a very sparse non-negative vector, 
but the e part can be very dense and have arbitrary signs. Existing results for recovering sparse signals 

'At first sight, thiis seems somewhat surprising as faces of different people look so different to human eyes. That is probably 
because human brain has adapted to distinguish highly correlated visual signals such as faces or voices. 
^Here we allow the entries of the error e to assume either positive or negative signs. 
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Fig. 2 

The "cross-and-bouquet" model. Left: the bouquet A and the crosspolytope spanned by the matrix ±I. Right: 

THE TIP OF the BOUQUET MAGNIFIED; IT IS A COLLECTION OF IID GAUSSIAN VECTORS WITH SMALL VARIANCE cr^ AND 
COMMON MEAN VECTOR /X. THE CROSS-AND-BOUQUET POLYTOPE IS SPANNED BY VERTICES FROM BOTH THE BOUQUET A 

AND THE CROSS ±1. 



suggest that £^ -minimization may have difficulty in dealing with such signals, contrary to its empirical 
success in face recognition. 

We have experimented with similar cross-and-bouquet type models where the matrix ^4 is a random 



matrix with highly correlated column vectors. The simulation results in Section III indicate that what we 
have seen in face recognition is not an isolated phenomenon. In fact, the simulations reveal something even 
more striking and puzzling: As the dimension m increases (and the sample size n grows in proportion), 
the percentage of errors that the £^ -minimization Q can correct seems to approach 100%! This may 
seem surprising, but this paper explains why this should be expected. 

B. The Main Model and Result 

Motivated by the above empirical observations, this paper aims to resolve the apparent discrepancy 
between theory and practice of £^ -minimization and gives a more careful characterization of its behavior 
in recovering [x,e] from the cross-and-bouquet (CAB) type models: 

y = Ax + e = [A l]w. (3) 

We model the bouquet, the columns of A, as iid samples from a multivariate Gaussian distribution 
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AA(/i, cr^Im), where a = z^m~^/^ with i' sufficiently small, \\n\\2 = 1, and ||/i||oo < Cfj,m~^^'^ for some 
Cfj_ G M+. These conditions insure that the bouquet remains tight as the dimension m grows, and that its 
mean is mostly incoherent with the columns of the cross ±1. 

We consider proportional growth for m and n, that is, n/m — > (5 S M+ as m — > oo. However, the 
support size of the sparse signal x is only allowed to grow sublinearly in m: \\x\\q = 0(m^^'') for 
some r/ > 0. This condition differs from (and is stronger than) the typical assumption in the sparse 
representation literature, where the support is often allowed to grow proportionally with the dimension 
[16]. In the next subsection, we will explain why the support of the signal x can only be sublinear if 
we allow the support of the error e to be arbitrarily dense. Nevertheless, this sublinear bound of sparsity 
is more than adequate for signals in many practical problems, including the face recognition problem. 
There, the support of x is bounded by a constant - the number of images per subject. 

This paper proves that under the above conditions 
for any p < 1, as m goes to infinity, solving the £^ -minimization problem ([2]) correctly recovers 
any non-negative sparse signal x from almost any error e with support size < pm. 
We leave a more precise statement and the proof of the fact to Section [II] In the remainder of this section, 
we discuss some of the main implications of this result in the broad context of sparse signal recovery, 
error correction, and some of its potential applications. 



C. Relations to Previous Results 

a) Restricted isometry and incoherence of the cross-and-bouquet model: As mentioned earlier, 
typical results in the literature for sparse signal recovery do not apply to equations of the type y = Ax+e. 
The cross-and-bouquet matrix [A I] is neither highly isometric nor incoherent. As a result, greedy 
algorithms such as Orthogonal Matching Pursuit [17], [18] succeed only when the error e is very sparse 



(see Section III a) for the simulation results and comparison with our method). However, this does not 
mean that the restricted isometry property is irrelevant to the new problem. On the contrary, the proof 
of our results precisely rely on characterizing a special type of restricted isometry associated with this 
new problem, see Lemma [5] in Appendix |Aj which is used in the proof of our main result. Moreover, 
unlike the typical compressed sensing setting, the solution [x,e] sought has very uneven density (or 
sparsity). This is reminiscent of the block sparsity studied in [19]. However, as we will see, the special 
block structure of the cross-and-boquet model enables sparse recovery far beyond the breakdown point 
for general sparse (or block sparse) signals. 
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b) Error correction: From an error correction viewpoint, the above result seems surprising: One 
can correctly solve a set of linear equations with almost all the equations randomly and arbitrarily 
corrupted! This is especially surprising considering that the best error-correcting codes (in the binary 
domain Z2), constructed based on expander graphs, normally correct a fixed fraction of errors [20]- 
[22] . The exact counterpart of our result in the binary domain is not clearj^ While there are superficial 
similarities between our result and [21], [23] in the use of linear programming for decoding and analysis 
via polytope geometry, those works do not consider real valued signals. In particular, the negative result 
of [23] for specific families of binary codes admitting linear programming decoders does not apply here. 

We can, however, draw the following comparisons with existing error correction methods in the domain 
of real numbers: 

• When n < m, the range of A is a subspace in M"^. In such an overdetermined case, one could 
directly apply the method of Candes and Tao [14] mentioned earlier. However, the error vector e 
needs to be sparse for that approach whereas our result suggests even dense errors (with support far 
beyond 50%) can be corrected by instead solving the extended ^^-minimization ([2]). Thus, even in 
the overdetermined case, the new method has clear advantages for coherent matrices A. This will 

a). 



be verified by simulations in Section III 



The sublinear growth of the support of a; in m is the best one can hope for in the regime of dense 
errors. In general, we need at least ||a;||o uncorrupted linear measurements to recover x uniquely. 
If an arbitrary fraction of the m equations can be totally corrupted by e, no fixed fraction of the 
equations remain good for recovering x. If, on the other hand, the error e is sparse, then the i^- 
minimization (|2]) is able to recover x with linear growth in support, as suggested by the existing 



theory [14], [16], [24]. Simulation results in Section III d) also confirm this phenomenon. However, 
in this paper, we are mainly interested in how the ^^-minimization behaves with dense errors, for 

When n > m, in general the Gaussian matrix A is full rank and the method of Candes and Tao [14] 



'it is possible that under an analogous growth model (see Section 
fractions of binary errors. 



II-Ai, the LP decoder of [21] could also correct large 
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no longer applies]^ Our result suggests that as long as A is highly correlated, the £^ -minimization 
(|2]) can still recover the sparse signal x correctly even if almost all the equations might be corrupted. 



This is verified by the simulation results in Section III c). 

c) Polytope geometry: The success of ^^-minimization in recovering sparse solutions x from under- 
determined systems of linear equations y = Ax can be viewed as a consequence of a surprising property 
of high-dimensional polytopes. If the column vectors of A are random samples from a zero-mean Gaussian 
AA(0, 1), and m and n are allowed to grow proportionally, then with overwhelming probability the convex 
polytope conv(j4) spanned by the columns of A is highly neighborly [24], [25]. Neighborliness provides 
the necessary and sufficient condition for uniform sparse recovery: the £^ -minimization ([T]) correctly 
recovers x if and only if the columns associated with the nonzero entries of x span a face of the 
polytope conv(A). 

In our case, the columns of the matrix A are iid Gaussian vectors with nonzero mean /x and small 
variance o"^, whereas the vectors of the cross ±1 are completely fixed. To characterize when the extended 
£^ -minimization Q is able to recover the solution [x, e] correctly, we need to examine the geometry 
of the peculiar convex polytope conv(^,ibl) spanned together by the random bouquet A and the fixed 
cross ±1. Thus, it comes as no surprise that the proof of our main result relies on a careful study of the 
geometry of such a "cross-and-bouquet" polytope. As we will show that indeed, the vertices associated 
with the non-zero entries of x and e form a face of the polytope with probability approaching one as the 
dimension m becomes large. Precisely due to high neighborliness of the cross-and-bouquet polytopes, 
the extended ^^-minimization Q is able to correctly recover the desired solution, even though the part 
of the solution corresponding to e might be dense. 

D. Implications on Applications 

a) Robust reconstruction, classification, and source separation: The new result about the cross- 
and-bouquet model has strong implications on robust reconstruction, classification, and separation of 
highly correlated classes of signals such as faces or voices, despite severe corruption. It helps explain 
the surprising performance of face recognition that we discussed earlier. It further suggests that if the 

''One could choose to pre-multiply the equation y = Ax + e with an "approximate orthogonal complement" of A, say the 
orthogonal complement of the mean vector ^, which is an (m — 1) x m matrix B. Then the equation becomes By — Be + z 
where z — BAx. If the norm of x is bounded, then 2: is a signal with small magnitude due to the near-orthogonality of B and 
A. In this case, one can view 2: as a noise term and try to recover e as a sparse signal via £^ -minimization. However, for e 
with arbitrary signs, the breakdown point for such i?^ -minimization is less than 50%. 
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resolution of the image increases in proportion with the size of the database (say, due to the increasing 
number of subjects), the ^^-minimization would tolerate even higher level of corruption, far beyond the 
60% at the resolution experimented with in [11]. Other applications where this kind of model could be 
useful and effective include speech recognition/imputation, audio source separation, video segmentation, 
or activity recognition from motion sensors. 

b) Communication through an almost random channel: The result suggests that we can use the cross- 
and-bouquet model to accurately send information through a highly corrupting channel. Hypothetically, we 
can imagine a channel through which we can send one real number at a time, say as one packet of binary 
bits, and each packet has a high probability of being totally corrupted. One can use the sparse vector x (or 
its support) to represent useful information, and use a set of highly correlated high-dimensional vectors 
as the encoding transformation A. The high correlation in A ensures that there is sufficient redundancy 
built in the encoded message Ax so that the information about x will not be lost even if many entries 
of Ax can be corrupted while being sent through such a channel. Our result suggests that the decoding 
can be done correctly and efficiently using linear programming. 

c) Encryption and information hiding: One can potentially use the cross-and-bouquet model for 
encryption. For instance, if both the sender and receiver share the same encoding matrix A (say a 
randomly chosen Gaussian matrix), the sender can deliberately corrupt the message Ax with arbitrary 
random errors e before sending it to the receiver. The receiver can use linear programming to decode 
the information x, whereas any eavesdropper will not be able to make much sense out of the highly 
corrupted message y = Ax + e. Of course, the long-term security of such an encryption scheme relies 
on the difficulty of learning the encoding matrix A after gathering many instances of corrupted message. 
It is not even clear whether it is easy to learn A from instances of uncorrupted message y = Ax. Even 
if the dimensions of the matrix A are given, effectively learning A from a set of observed messages 
^ = 2/2) • ■ • ) Vk] is still ^ largely open problem, known in the literature as the "dictionary learning" 
problem. Existing algorithms are iterative or greedy in nature, with no guarantee of global optimality [9]. 
Although its hardness has not been precisely characterized, we expect dictionary learning from highly 
corrupted observations to be an even more daunting problem, a challenge for anyone who tries to break 
this encryption scheme. 

II. ROADMAP OF THE PROOF 



In this section, we begin with a precise statement of our main result in Section II-A We then lay out the 



roadmap for the proof. Section II-B outlines the key geometric picture behind the proof. In Section II-C 
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we then prove the main result, assuming that two technical conditions in Lemma |2] hold. Section II-D 



discusses the ideas required to establish these conditions, leaving a number of details to the Appendix. 
A. Problem Statement 

Motivated by the face recognition example introduced above, we consider the problem of recovering 
a non-negative|^ sparse signal xq G M" from highly corrupted observations y G M™: 

y = Axq + eo, 

where cq G M*" is a sparse vector of errors of arbitrary magnitude. The model for A G ]R™x" should 
capture the idea that it consists of small deviations about a mean, hence a "bouquet." In this paper, we 
consider the case where the columns of A are iid samples from a Gaussian distribution: 

A = [ai,...,a„] GM™""", ai ~iidAr(^/x, ^I^^ , ||/x||2 = 1, \\f^\\oo < C^m^^/^ . (4) 

Together, the two assumptions on the mean force it to remain incoherent with the standard basis (or 
"cross") as the dimension increases. 

We study the behavior of the solution to the £^ -minimization Q in this model, in the following 
asymptotic framework, which we term "weak proportional growth": 

Assumption 1 (Weak Proportional Growth): A sequence of signal-error problems exhibits weak pro- 
portional growth with parameters (5 > 0, /? G (0, 1), Co > 0, ??o > 0, denoted y^PGs,p,Co,r]o if as m ^ oo, 

-^S, ^^p, ||a;o||o<Comi"^*. (5) 

m m 

This should be contrasted with the "total proportional growth" (TPG) setting of, e.g., [26], in which the 
number of nonzero entries in the signal Xq also grows as a fixed fraction of the dimension. In that setting, 
one might expect a sharp phase transition in the combined sparsity of (a;o,eo) that can be recovered 
by £^ -minimization]^ In WPG, on the other hand, we observe a striking phenomenon not seen in TPG: 
the correction of arbitrary fractions of errors. This comes at the expense of the stronger assumption that 
ll^^ollo = o(m), an assumption that is valid in some real applications such as the face recognition example 
above. 

^The non-negativity assumption is important: in the iiighly-coherent systems considered here, ^^-minimization generally does 
not recover signals xo with arbitrary signs. Geometrically, this is would require vectors from the "bouquet" to "see" through 
the crosspolytope to vectors that are nearly antipodal to them. 

^Existing results (e.g., [24]) do not prove the existence of phase transitions in inhomogeneous models such as the one 



considered here. However, simulations suggest that in total proportional growth, such transitions do occur (see Section III d)) 
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Before stating our main result, we fix some additional notation. For any n G Z+, [n] denotes the set 
{1, . . . Let / = supp(a;o) C [n], J = supp(eo) C [m], a = sgn(eo(J)), and let ki = \I\ be the 
support size of the signal xq and k2 = \J\ the support size of the error bq. For an arbitrary ri x r2 
matrix M, if Li C [ri] and L2 C [r2], Ml^^l^ denotes the \Li\ x IL2I submatrix of A indexed by these 
quantities. We use M^^ , as a shorthand for M/,^ j^^j. M* denotes the transpose of M. Also, we use Ij 
(or Ij) to represent a vector in M" (or M"*) that has ones on the support / (or J) and zeros elsewhere. 
To reduce confusion between the index set / and the identity matrix, we use I to denote the latter. 
Below, where the symbol C occurs with no subscript, it should be read as "some constant." When used 
in different sections, it need not refer to the same constant. 

In the following, we say the cross-and-bouquet model is £^ -recoverable at (/, J, <t) if for all a^o > 
with support / and bq with support J and signs a, we have 

(iCO) ^o) = argmin + ||e||i subject to Ax + e = Axq + sq, (6) 

and the minimizer is uniquely defined. From the geometry of ^^-minimization, if ([6]) does not hold for 
some pair (a^o, eo), then it does not hold for any {x, e) with the same signs and support as {xq, bq) [25]. 
Understanding ^^-recoverability at each (/, J, a) completely characterizes which solutions to y = Ax + e 
can be correctly recovered. In this language, our main result can be stated more precisely as: 

Theorem 1 (Error Correction with the Cross-and-Bouquet Model): For any J > 0, 3z/o((5) > such 
that if u < vq and p < 1, in WPG5_p^c'o.»?o ^i^^ ^ distributed according to (|4]), error support J chosen 
uniformly at random from (f^^) and error signs a chosen uniformly at random from {±1}^^ 



lim PA,j,a 



?^-recoverability at {I, J, a) V / e (^^^^ 



1. (7) 



In other words, as long as the bouquet is sufficiently tight, asymptotically £^ -minimization recovers any 
non-negative sparse signal from almost any error with support size less than 100%. 

B. Problem Geometry 

We first restate the necessary and sufficient conditions for £^-recoverability geometrically, as separation 
of a higher-dimensional £^-ball and an affine subspace (see Figure [sj. To witness this separation, we must 
show the existence of a separating hyperplane, whose normal we will denote by q. 

Lemma 1: Fix (/, J, a), and define w = A*j^a — 1/ G M" and 

In-fci 



G = 



npxn 



p = m + n — ki — k2- (8) 
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Suppose G has full column rank n|^The model is ^^-recoverable at (/, J, a) iff 

3gGRP such that ||q||oo < 1 and G*q = w. (9) 
Proof: As above, let y = Axq + eg. The pair {xq, bq) is the unique minimum ^^-norm solution to 
the equation y = Ax + e iff 

$ {Ax,Ae) / : AAx = -Ae, \\x + Aa;||i + ||e + Ae||i < ||a;||i + ||e||i. (10) 

Due to the geometry of £^ -minimization and the convexity of || • ||i, we lose no generality in assuming 
that cc = 1/, e G {-1, 0, 1}"" and ||Aa;||oo < 1, ||Ae||oo < 1. Then, 

\\x + Aa^lli = ||ic||i + l}Ax + ||Aa;/c||i, and ||e + Ae||i = ||e||i + e*Ae + ||Aejc||i. 



Substituting into (lOi and using Ae = —AAx yields that {x,e) is optimal iff 
$Ax^O : ||^jc_,Aa;||i + ||Aa;/c||i < (A*e - 1/, Aa;) 



Condition ( |II-B| ) is satisfied iff 

yAx^O, \\GAx\\i> {w,Ax). (11) 

Let Hw C M" be the affine subspace {x \ {w,x) = 1}. The function ||G • ||i defines a norm 
II • ||o on M". Geometrically, ( [TT] ) is satisfied iff the unit ball i?o of || • ||o is contained in the half space 
= {x I {w,x) < 1}, as illustrated in Figure [s] This unit ball is a convex polytope, given by the 
inverse image (under the injective map G) of the intersection of 7?.(G) and the unit ^^-ball Bi in MP: 

5o = nSi(MP)] . (12) 

Now, 5o C H- iff [7^(G) n Bi{RP)] CG[H-] iff Bi{RP) nG[clH+]= 0. These two closed convex 
sets are nonintersecting iff there is a hyperplan^ //g = {t; G | {q,v) = 1} C W separating them 
(see Figure [3] again). We lose no generality in assuming that Bi C H^, that G[clH^] C clH^, and that 
Hq meets the relative boundary rbd G[clH^] = G[Hm]. The first condition occurs iff ||q||oo < 1> while 
the second occurs iff G*q = w. ■ 

The most natural candidate for a normal vector q is the minimum £^-norm solution to this equation, 

Qo = {G^)*w = G{G*G)-^w. (13) 

^In the model outlined above, this occurs with probability one for m sufficiently large. 
^Notice Hq cannot contain G interior(i?i), so the normalization {q,v) = 1 is appropriate. 
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Fig. 3 

Geometry for the proof of Lemma [T] The unit ball Bo can be separated from Hu, in R" if and only if in the 

LIFTED SPACE , THE ^^-BALL Bi CAN BE SEPARATED FROM THE IMAGE OF UNDER THE INJECTIVE MAP G. Hq IS THE 
SEPARATING HYPERPLANE WITH A NORMAL VECTOR q. SUCH AN Hq MIGHT NOT BE UNIQUE IN W, AND WOULD BE 
THE NORMAL TO THE SPECIAL SEPARATING HYPERPLANE THAT CONTAINS G{Hw). 



When we use this particular normal Qq, we are demanding that the projection of Bi onto 7^(G) lie in 
G[H^]. Since the projection contains the intersection, Bi C {{Qq, •) < 1} is a sufficient, but not necessary 
condition. It is not surprising, then, that this condition often does not hold - empirically, ||golloo > 1 
with high probability. However, as we will see, the set of violations is almost always small, and we can 
apply a simple iterative scheme to improve Qq to ^ valid separator q with ||q||oo < 1- 

C. Iterative Construction of Separator 

Our next lemma argues that if we are given an initial guess at a normal vector G W whose 
hyperplane Hq^ separates G[H.,jj] from most of the vertices of Bi, then we can refine Qq to a q^ that 
separates G[Hw\ and all of the vertices of Bi. In general, finding such a q^ requires solving a linear 
programming problem. We will analyze the feasibility of this linear program by considering an iteration 
similar to the alternating projection method for finding a pair of closest points between two convex sets. 
In this case, the two convex sets of interest are the hypercube of radius 1 — e and the affine subspace 

q^ + n{G)^. 

In the following lemma, Qq G W is arbitrary (though Qq = G'^*w is natural). We will construct a 
sequence of vectors {qk)'kLo- Fix a small constant e > 0, and define the operator 9 which takes the part 
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of a vector that protrudes above 1 — e: 

{0, for < 1 -e, 

sgn{x{i)){\x{i)\ — 1 + e), for \x{i)\ > 1 — e. 
We iteratively construct by setting 

Qi+i = qi-T^n(G)^G(li = (li-^Qi + '^n{G)^(li- (15) 

Notice that by construction, G*qf. = G*qQ = w for all k. So if dq^ —>■ 0, then ||qi||oo < 1 eventually, 
and q^ is a valid separator. 

Before proving that this iteration produces a valid separator with high probability, we first demonstrate 
its behavior on a simulated example with m = 3,000, 6 = A, u = .1, p = .65, and ki = 10. Figure 
|4] plots the sorted absolute values of entries of q^. Notice that the sorted coefficients clearly divide into 
two parts; these correspond to the uppei|^(i?i) and lower {R2) indices. The initial separator Qq cleanly 
separates G[Hw] from most of the vertices of Bi: only 39 entries protrude above 1 — e. These entries are 
quickly iterated away: ||6'q|| decreases geometrically until after 5 iterations a vaUd separator is obtained. 

Lemma 2: Suppose 3c G (0, 1) such that 

c ■ hn{G)s\\2 ^ -, 

? = sup ip-y < 1, (16) 

|ls|lo<cp, s^O W^h 

and 

l|goll2 + Y3^ll^9oll2 < (l-e)VcP, (17) 

where G is the matrix defined in ([8]l. Iteratively construct a sequence of vectors {Qj}, with qr^ = q^_i — 



^7^(G)^^Q^-l' where 9 threshold-residual operator defined in ^14} . Then lim.k^rx> (^Qk — 0- 



Proof: Let = {i\ \qk{i)\ > 1 ~ ^} C [p], and consider the following three statements: 

k 

hkh<\\qoh + \\Oqoh^e, II^Qfclb < II^Qolbe', #n < cp. (18) 

j=0 

We will show by induction that these statements hold for all k, establishing the lemma. The first two 
statements of ([18) hold trivially k = 0. For #To, notice that by ( fTTj ), 

#r„ < ^ < CP. 

'where necessary, we will use Ri = {1, . . . ,m — ^2} C [p] to index the upper rows of G (corresponding to A), and 
R2 = [p] \ Ri to index the lower rows. 
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Fig. 4 




500 1000 1500 2000 
Sorted coefficient indices 

Iteration 2 

#{* I l'72(i)l > 1} = 38 
||6»q2||2 = 0.006 



Iterative refinement producing a separating hyperplane. Here, m — 3000, S — A, v — .1, p = .65, ki = 10. We plot 

THE SORTED MAGNITUDES OF THE ENTRIES OF q.. AT LEFT, SEPARATES G{H^) FROM MOST OF THE VERTICES OF Bi: 
ONLY 39 VIOLATIONS OCCUR. THE DISTINCT BIMODAL CHARACTERISTIC OF IS DUE TO THE DIFFERENCES BETWEEN 
THE STATISTICS OF THE TOP (7?i) AND BOTTOM {R2) INDICES. APPLYING THE ITERATION DECREASES \\6qi\\ 
GEOMETRICALLY; AFTER 5 ITERATIONS, A VALID SEPARATOR IS OBTAINED. 



Now, suppose the three statements hold for 0, ... , k. Since Oqj. has the same signs and smaller magnitude 
than Qf., Il^jt — 0qy.\\2 < HQfelb; combining this with the inductive hypothesis we have 

llQfc+ilb = llQfe - ^Qfc + 7^7^(G)6'Qfcll < Ikfc - &(lk\\ + hn(G)G(lk\\ < WQuW + ^''"^^ll^'qoll 

k+l 

< \\floh + Pqo\\2^i\ 
Similarly, notice that since TT'ji(^G)^'ik dominates 0{qi. — Oq^. + T^Ti(G)^qk) elementwise, 

II^Qfc+ill < hn(G)Oqk\\ < aOqkW < C'+'ll^Qoll- 

Finally, for the sparsity result T^+i < cp, note that 

fe+i 

/cp, 



k+l ^ 

Wk+ih < llqoll2 + ll^gollX]^* - ll^olb + ^3tII^9oII2 < 



and SO Oqj^j^i must be (cp)-sparse. Since ( [18] ) holds for all k, \\9qy.\\2 0. 
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D. Putting it All Together 

By Lemmas [T] and [2j if the two conditions ([16]) and ( [TT] ) hold for a given sign and support triplet 



(I, J, cr), then {I, J, a) is £^ -recoverable |^ We will show that as m ^ cxd, for any sequence of signal 
supports /, ( [T6l ) and ( [T7] ) hold with probability approaching one in the random matrix A and error {J,cr). 
The probability that either condition fails for a given / will be small enough to allow a union bound 
over all /, establishing Theorem [T] We will assume we are in the large error regieme, with p = 1 — p 
lower bounded as specified in the lemmas below. The conclusion still follows for smaller error fractions, 
since whenever (/, J, <t) is £^ -recoverable, so is (/, J', <jj>) for any J' C J. 

In this section, we lay out the main ideas for the rest of the proof, which consists of two parts, one 
for each of the conditions in Lemma [2] We establish that following two properties hold simultaneously 
with probability at least 1 - e-^'^'"'"'''(i+°(i)): 

1) For a small enough constant c, the projection ratio ^ for cm-sparse signals onto Tl{G) is bounded 
below 1 by a polynomial function in v. More precisely, ^ < 1 — Cv^ for some constant C > 0. 
As a result, the coefficient in the second condition ( [TT] ) is bounded by C~^u^^. 

2) As m goes to infinity, the ^^-norm of the initial separating normal vector Hgolb is bounded above 
by i^O(m^/^), and H^Qolb is bounded above by e~"/'^^0(m^/^) for some constant a. 

Putting these results together, the initial separating normal vector Qq satisfies: 

Ikolb + ^II^Qolb < i^O(mi/2) + C-V-8e-"/'^'0(mi/2). (19) 
1 - ? 



If the deviation u of the bouquet is small enough, the second condition ([TtJi of Lemma [2] will be satisfied. 



since the right hand side, (1 — e)^/cp = i7(m^/^) is independent of i^. Hence, by Lemma |2j the initial 
normal Qq will converge to a valid normal vector that separates the ^^-ball Bi from the subspace G[Hw], 
establishing ^^-recoverability at (/, J, cr). Comparing the failure probability for the two conditions to the 
number of subsets / C [n] of size Com^~'^" then completes the proof of Theorem [l] These arguments 
are laid out more precisely and quantitatively in Section [C] of the appendix. 

Whereas Lemmas [T] and [2] have simple geometric and algebraic proofs, the above results require more 
detailed analysis of large Gaussian matrices. We outline the main ideas of their proof in this section, 
leaving many of the technical details to the appendix. The derivation is based on recent (and now 
widely-used) results on concentration of Lipschitz functions [3] , which state that if a; is a d-dimensional 

'"Notice that conditions l |16^ and l |17[ l depend on (J, J, cr), tiirough the construction of the matrix G. 
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iid J\f{0, 1) vector and / : M'^ ^ R is 1-Lipschitz, then 

P[\f{x)-Ef{x)\>t] < 2exp 



2^2 



Two cases are of particular interest here. First, the norm concentrates as (see, e.g., [27]): 

2(/3-l)2 



m\2 



> f3Vd 



d 



(20) 



(21) 



(22) 
(23) 



< exp 

Second, as has been widely exploited in the compressed sensing literature (e.g., [14], [16]), the singular 
values of rectangular Gaussian matrices with aspect ratio a concentrate about the values 1 ± ^/q predicted 
by the Marchenko- Pasteur law: 

Fact 1 (Concentration of singular values [3]): Let A G R™^", (m > n) be a random matrix with 
entries iid Af{0, ^). Then for any t > 0, 

P [an,ax{A) > 1 + + o{l) + < e-'"*'/^^ 

p[armn{A)<l-y^ + o{l)-t\ < e'^^'/^^ 

We will also return to ( |20l ) in the proof of Lemma [8] of the appendix. 

1) Projection of Sparse Vectors: In this subsection, we upper bound the norm of the projection of any 
sparse vector onto TZ{G). Since the lower {R2) coordinates of 

I 

contain an identity matrix, when the variance v"^ /m of the perturbations Zi, Z2 is small, we expect that 
sparse vectors with support on R2 will be very close to TZ{G). The following lemma verifies that this is 
the case, but argues that distance to is at least Q{u^). The technical conditions appear complicated, 

but simply assert that the fraction of nonzeros c is sufficiently small. 

Lemma 3 (Projection of Sparse Vectors): Suppose that p < S and v < min (g, (512/(5)^/'^), 



G 



Ai 


A2 







I 





c < mm 



P 



P 



, pH{c/p) + 6H{c/6) < 



P 



1287r2' 



(24) 



1024' 64(1 + 2C7/,p-i/2)2 
where H{-) is the base-e binary entropy function. Then the projection of a sparse vector s G R*' with 
||s||o < cm onto the range of G is bounded as 



lF7e(G)S||2 ^ g 
sup ip-j < 1 — ly 

\\s\\o<cm, Sy^O ll^lb 



32 + 1281/2 (^^5 + ^)2 



(25) 



on the complement of a bad event with probability e '^"^ 
Proof: The projection of s = [ ] onto Tl{G) solves 

|2 iirs, 1 ui iii2 



mm 



Gr\ 



mm 



G [ S2+U2 J 112 



min - AiUi - ^2(^2 + ■U2)||2 + U2'"2- 

1*1 ,1*2 
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By minimizing the first term, we can write the unique optimal ui in terms of the remaining variables: 

ui = {AlAiy'Alsi-{AlAi)-'AlA2{s2 + U2) 

and subsequently, the optimal U2 satisfies: 

-A^si + A^AiUi + ^2^2(^2 + U2)+U2 = => (l + AlTrA±A2) U2 = Al7r^±si - A27r^^^2S2, 

where vr^i denotes the projection matrix onto the orthogonal complement of Tl{Ai). 

Write A*2TTa± = USV* with U £ R(Sm-k,)x{pm~k,) y ^ ^pmx{fmi~k,) orthogonal matrices, and 
the diagonal of 5* G i^(pm-A;i)x(pm~fci) containing the nonzero singular values of A^ha^- Then if U2 is 
the solution to the above equation 



v^7^(G)s||2 > \\u2h = \\{S' + 1)-'SV*[1 -A2][%\] 



12 



= \\{S^ + 1)-^S[V* -SU*][l\]\\^. (26) 

Above is the norm of the product of a diagonal matrix (S*^ + l)"^^, a wide matrix [ V* — SU* ], and 
a sparse vector s. We will bound it by lower bounding the elements of the diagonal matrix, and then 
lower bounding the "restricted minimum singular value" 

lcm[[V -SU \ ) = inf — . 

lisllo < cp, s^O ||S||2 

We first drop the top row of (S^ + 1)~^S[V* — SU*]. This allows us to uniformly lower bound 
the diagonal of {S"^ + l)^^^. While o"i can be quite large due to the inhomogeneous term (/Xjd*), 
and hence -^^^ can be quite small, for the remaining singular values -^p-^ is at least on the order of 
u. Let S e ^{pin-ki-i)x{pm-ki-i) jj^g diagonal matrix obtained by dropping the row and column 
of S corresponding to the largest singular value; V and U are obtained by dropping the corresponding 
columns. From 



11^^2112 > CS^ + 1)-^~S[V* -~SU*][l\\ > T^P4S^W[^* -5C/*])||s||2,(27) 

where crmm(^2^AjL) is the smallest nonzero singular value and £72(^2^^^) is the second largest singular 
value. 

a) Bounding the second largest singular value cr2(^2^^i^)-' Write fi = tt^-l/Xjc, and notice that 

\\A2'n A^'T^u^'v\\2 

a2{A*2TTA±) = inf sup rr^ = lllf ai{A*2TT ^±71^^) 

< ai{A*2-KAA^-Kf^±) = (Tl{Zl'K(^^^^^z^)±)- 
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Choose any orthonormal basis for the subspace T, = (TZ{Zi) + TZ{fXja)) . Since S is probabiUstically 
independent of Z2, the representation of the projection Zlvrs with respect to the chosen basis is simply 
distributed as a {6m — ki) x [pm — ki — 1) random matrix Z2 with entries AA(0, jm). Since ~j^^=^^'2 
is AA(0,^),by FactfTl 



P 



0"! 



u\/5m~ki 



> 1 + J p'^-\7'^ +t 

— " dm— Ki 



< 



exp I 



-{t-o{l)f{5m-ki)/2) 



(28) 



and so P ai{Z2) > 2u{y/6 + y/p) < e-C™(i+°(i)). On the complement of this bad event, ^^(yl^vr^^) < 

4zy2(^+^)2. 

b) Bounding the smallest nonzero singular value amin{-^2'^A^) — i^^tceAj^ ^^Jilf!^"' ^ ^ 
-^pmx{pm-ki) ^ matrix whose columns form an orthonormal basis for Aj^, and let Q G -\^{Sm-ki)x{Sm-ki 
be an orthonormal basis for 'i-j'^n^f.^- Then amin{A2irj^±) > (Jmin{Q*A2W). Conditioned on Ai, Z'2 = 
Q*A2W G M{'5m-fci-i)x{pm-fcO AA(0, i^V"i)- Applying Fact[l] (with a similar rescaling argument 
to the one used for amax{,Z2) above) gives that 



P 



[Z'2)<\(f^-rp 



< g-Cm(l+o(l))^ 



(29) 



On the complement of this bad event, ^^^(Agvry^-L) > \{\fb — ^/p)- 

Finally, in Lemma [5] of Appendix |Aj we show that under the stated conditions, the restricted singular 
value 7cm in ([27]) satisfies 7c.m( [V* -SU*\ ) > ^ with probability at least 1 - e-<^'"(i+°(i)). Notice 
that this bound agrees with (and in fact is looser than) the Marchenko-Pasteur law for a pm x cm Gaussian 
AA(0, zv^/m) matrix (i.e., the concentration result of Factjl]). In fact, the proof argues that the two blocks 
of this matrix are probabilistically independent, and then applies Fact [T] to an equivalent pair of Gaussian 



matrices. The somewhat technical conditions ( [24| ) introduced here are necessary to ensure that a union 
bound over all subsets of cm columns remains small. 

Combining the three results, we have that for all s G with ||s||o < cm, 



T^GS\\2 



> 



{V6- 



s 2 



32 + 1281/2(^ + ^)2 



(30) 



Notice that 



ll«ll 



^ ~ ( "%^r" ) < a/I - P'^ < l-fi^, where we have used that 1-/5^ > y^l - /52 



for (3 < I/V2; this is guaranteed for i' < (512/(5)^/^. Combined with (30l, this implies (25l. ■ 

2) Initial Separating Hyperplane: In this section, we analyze the initial separator gg, obtained as the 
minimum 2-norm solution to the equation G*q = w. We upper bound both H^olb and ||0qg||2, where 



the operator 9 defined in ( 14i retains the portion of a vector that protrudes above 1 — e in absolute value. 
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These bounds provide the second half of the conditions needed in Lemma |2] to show that Qq can be 
refined by alternating projections to give a true separator. 

Lemma 4: Suppose p < 6 and u < g • Then for G defined in ([8]l and w = A*j^a — 1/, 3 

constants ai, 0.2 such that Qq = G'^*w satisfies 

Ikolb < axvm^l'^ + o(mi/2)^ (31) 
ll^golb < a2exp(^-^^ ^ o(^i/2). (32) 

on the complement of a bad event of probability < e~*^"^^ "0/2(1+0(1)) 

Proof: Notice that Gt* = G{G*G)-^ = [^^ {G*GY^ + {G*GY^, where Zi = Zj.^i 

and Z2 = Zja ja. Expanding qg = gives 



+ ['^(j^] (l*(G*G)-^Z},cr - + (/xj,cr) (33) 



In this section, we concentrate our efforts on the first term above. In Lemma |7] of Appendix |BJ we give a 
more detailed analysis of {G*G)^^, which shows that the remaining terms are all negligible, contributing 
o(m^/^) to HqoII- This is essentially due to the presence of a large common term in the columns of 
G: the most significant term in G*G is iJ.*j^^ijaH* , and shrinks 1. More precisely. Lemma [v] 

of Appendix B shows that with probability at least 1 — e"*"™' "0/^(1+0(1))^ 

II ^i] {G*G)-'^Zl,a\\ < Gm^l'^-^°l\ 

This remaining term can be further simplified by splitting out several of the inhomogeneous parts of 
Define Q = Z*j,^Zjc^, + [[)?] = [flfj zfif+il ^ C = G K"- In terms 

of these variables, G*G = Q + CI* + IC* + all*. Applying the matrix inversion lemma, 

{G*Gr^ = Q-^ - Q-^/^MEM*Q-^/^, (34) 

where M = y^-*: Q G M"^^, and E is an appropriate 2x2 matrix. Since 1? = 

. iiv 11I2 ii<y '"CII2 J 

Zj^cr G M" is iid AA(0, i^^p) independent of G, with high probability it is almost orthogonal to the 
rank-2 perturbation T = Q~^/'^MEM*Q-^/^: P [\\ttt'&\\ > m^/^-'^i/^] - e-^™'"'"^']^ Using Fact [l] 
and block singular value identities, it is not difficult to showj^ that < ^ with probability at 

"||7rri?|| is distributed as the norm of a 2-dimensional jV{0,u^p) vector. The bound follows from the x tail bound l |21| ). 
'^Use that a^i„ ([^1 ^^^]) > cr^,„(Zi) - ff^ii "^^^) and apply Factjljto bound each term. 
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least 1 - e-^™(i+°(i)). Combined with the bound \\{G*G)-^\\ < Cg from Lemma we have that 
||r|| < ||(G*G)-i|| + ||Q-i|| < Cg + ^ is bounded by a constant, and 

IPo^t]r^ll < IPo^^i^ll lirillKr^ll < (l + 2i.2(V^ + ^)2y/'(^CG + ^) mi/2-W4 
and the remaining part of Qq is 

The first two terms involve projections of onto fci -dimensional subspaces, and hence are of lower 
order. That is, for S = null([Q-i]/,,)"^' we have P [Wn^-^h > mi/2-^o/4j ^ g-Cm^-^o/^^ ^^^^^ ||^^|| 
and IIQ^^II are bounded by constants with overwhelming probability, with probability at least 1 — 
[^q] [Q~^]f,''^ ^ C'vn}/'^~'^°^^ . Identical reasoning shows that on the comple- 



-Cmi-''o/2(l+o(l)) 



ment of a bad event of probability x e 



This leaves [ j ij Expressing Q as [ y ] and applying the Schur complement formula 

gives [Q"^]/c = w-^ + W^W{U-^ -V*W^'^V)-^V*W-^ , where W = Z^Zs + I, V = Z^Zi, and 
U = ZlZi. Because W -^1, \\W~^\\ < 1. With probability at least 1 - e-C'"(i+°(i)), \\U\\ = \\Zi\\^ < 
2v'^p, (Jmin{U) > and ||F|| < ||Zi||||Z2|| < 2 i/^ + and so 



fTmm(C/-l) - ||V^P||l^-l|| - l-8zy6(l + ^)2 

is bounded by a constant. Let S' denote the fci -dimensional range of this matrix. With probabiUty 

> 1 _ e-Cm-™/^(i+o(i))^ ||^^,^|| < ^i/2-„o/4^ and so 

\\[\-]W-^V{U-^ -V*W-^Vy'^V*W-^'d\\ < C""m^/2-r,o/4^ 

leaving only Qo = [f] (Z|Z2+l)-^i9/c. With probability at least l-e-^'"(i+°(^)), ||i9/c|| < V2u^/^m^/'^, 
and so 



llQolb < ||[t]|| ll^^^ll ^ ^1 + 11^2111 < i/^2 5p(^l + 2i/2 + V^)^^ (35) 

establishing the first part of the lemma. 

For the second part, we will show that the the upper (Ri) and lower {R2) parts of Qq can be 
bounded elementwise by a pair of iid Gaussian vectors. Since for each of these vectors, the Lipschitz 
function ||6' • || is concentrated about its (very small) expectation, the desired result follows. For the 
upper block, write Z2 = QR, where Q G RP^xp"^ is an orthogonal matrix, and R G M.pmxiSm~-k^) 
an upper-triangular matrix with non-negative elements on the diagonal. With probability one (as long 
as rank(Z2) = pm), Q and R are uniquely determined by Z2. Moreover, Q is a uniform random 
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= QR where Ri G Rpmxpm ^^^^^ 



orthogonal matrix, probabiUstically independent of R^ Since QoiRi) = QR{R* R + l)~^^i<^ is the 
product of a uniform random orthogonal matrix and an independent vector R{R* R + l)^^'dia, |||»|^|||| is 
uniformly distributed on S^^™-!. With probability > l-e-^"^(i+°(i)), ||qo(^i)ll = ||2'2(Z|Z2 + l)i?/o || < 
ll-^2|||Knuii(Z2(Z2*Z2+i)-i)^''^/''ll ^ 2i/^-y/p(y^+ v^) m^/^p] Introduce an independent random variable 
Ai distributed as the norm of a (/)m) -dimensional iid AA(0,o"^) vector with a = 4u'^{^/p+ V5) (i.e., an 
appropriately scaled xpm rv), and define 

, • , go(-^l) .o^^ 

lko(^i)ll 

Since 4>i is the product of a uniform random unit vector and an appropriate x random variable, its 
distribution is iid Af{0,a^). With probability 1 - e-'^™(i+°(i)), ||0i|| > f^/prn > ||qo(iii)||, so 0^ 
dominates qQ{Ri) elementwise and 116*0^11 > ||6'qo(-^i) II- Applying Lemma [S] of Appendix [b| with 
probability 1 - e-<^™(i+°(i)), 

Il^0ill2 < 4exp(-^) = 4V?exp(-^^^-j^l— < 4V^exp(-^). 

(37) 

For the lower {R2) coordinates, write Z| = [Qi Q2] 



triangular matrix with nonnegative diagonal elements, Qi is an orthogonal matrix, and Q2 is a random 
orthobasis for TZ{Qi)-^ (so that Q £ ]^in-ki)x{n-ki) orthogonal matrix). Again from the rotational 
invariance of the Gaussian distribution, Q is a uniform random orthogonal matrix, independent of R, and 

qo(i?2) = (^2^2 + = Q{RR* + iy^Q*^i^ = Q{RR* + (38) 

where 7 = Q*'dia is an iid J\f{^,v'^p) random vector, independent of Q. Hence, 90(^2) is the product 
of a uniform random orthogonal matrix Q, and a probabilistically independent vector {RR* + I)^^7, 
and its orientation nl^l^^jji is a uniform random vector on S"^'^'i^^. As above, introduce an independent 
random variable A2 distributed as the norm of an (n — fci) -dimensional iid AA(0,4i/^p) random vector, 
and define 

, > 90(^2) ,^Q. 

02 = -^2 ||. x|| ■ (39) 
lko(^2)|| 

The product of an independent unit vector and (appropriately scaled) Xn-ki scalar, 02 is distributed 

as an iid AA(0,4z/^p) vector. With probability at least 1 - ^-C^i'^+oW) ^ > ^J2v^^Jn - k\, and 

'^This follows from the rotational invariance of the Gaussian distribution: left multiplication by an independent orthogonal 
matrix sampled according to the invariant measure yields an independent pair (Q', _R) with Q' R — Z'^ =d Z^- 

'^Here, we have l[2TJ to bound the norm of the projection of 1? onto the (pm)-dimensional subspace null(Z2(^2^2 + 
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||Qo(^2)|| < ll^/<=ll ^ \f2v^\/n — k\. Therefore, 02 dominates qo(i?2) elementwise, and ||6'02ll ^ 
11^90(^2)11- By Lemma |8] 

116*02112 < 4\/5exp(-— V) ^^''^ ^ 4\/5exp(--^) m^/^ (40) 

Combining the bounds on ||6'0i|| and ||6'02|| gives the second part of the lemma. ■ 

in. Simulations and Experiments 

In this section, we perform simulations verifying the conclusions of Theorem 1, and investigating 
the effect of various model parameters on the error correction capability of the £^ -minimization Q. 
In the simulations below we use the publicly available £^-magic package [28], except for one (higher- 
dimensional) face recognition example, which requires a customized interior point method. Since l^- 
recoverability depends only on the signs and support of (a3o,eo), in the simulations below we choose 
ico(i) G {0,1} and eo(«) G {—1,0,1}. We will judge an output {x,e.) to be correct if max(||;E — 
icolloo, ||e - eolloo) < 0.01. 

a) Comparison with alternative approaches: We first compare the performance of the extended 



?^ -minimization 



mill ||a:;||i + ||e||i subject to y = Ax + e 



to two alternative approaches. The first is the error correction approach of [14], which multiplies by a 
full rank matrix B such that BA = oj^ solves 

min ||e||i subject to Be = By, 

and then subsequently recovers x from the clean system of equations Ax = y — e. The second is 
the Regularized Orthogonal Matching Pursuit (ROMP) algorithm [29], a state-of-the-art greedy method 



for recovering sparse signals |^ For this algorithm, we use the implementation from http://math. 
lucdavis . edu/~dneede ll/| 

For this experiment, the ambient dimension is m = 500; the parameters of the CAB model are u = 0.05 
and 6 = 0.25. We fix the signal support to be ki = 15, and vary the fraction of errors from to 0.95. 
For each error fraction, we generate 500 independent problems. Figure |5] plots the fraction of successes 

'^This comparison requires n although our method is not limited to this case. 

"'For the models considered here, less sophisticated greedy methods such as the standard orthogonal matching pursuit fail 
even for small error fractions. 
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Fig. 5 

Comparison with alternative approaches. Here, we fix m = 500, 5 = 0.25, = 0.05, and ki = 15, and compare 

THREE APPROACHES TO RECOVERING THE SPARSE SIGNAL Xo FROM ERROR eo. THE FIRST, DENOTED "i^ — [A I]" SOLVES 
THE EXTENDED £^ MINIMIZATION ADVOCATED IN THIS PAPER. THE SECOND, DENTED "i^— _L COMP" PREMULTIPLIES BY 
THE ORTHOGONAL COMPLEMENT OF A, AND THEN SOLVES AN UNDERDETERMINED SYSTEM OF LINEAR EQUATIONS FOR 
THE SPARSE ERROR e [14]. THE FINAL APPROACH IS THE GREEDY REGULARIZED ORTHOGONAL MATCHING PURSUIT 



(ROMP) [29]. 



for each of the three algorithms, as a function of error density p. There the extended £ -minimization 
is denoted "L^ — [A I]" (red curve), while the alternative approach of [14] is denoted "L^— _L comp" 
(blue curve). Whereas both ROIVIP and the approach of [14] break down around 40% corruption, the 
extended ^^-minimization continues to succeed with high probability even beyond 60% corruption. 

b) Error correction capacity: While the previous experiment demonstrates the advantages of the 
extended ^^-minimization (|2]) for the CAB model. Theorem 1 suggests that more is true: As the dimension 
increases, the fraction of errors that the extended ^^-minimization can correct should approach one. We 
generate problem instances with 6 = 0.25, u = 0.05, for varying m = 100, 200, 400, 800, 1600. For each 
problem size, and for each error fraction p = 0.05, 0.1, . . . , 0.95, we generate 500 random problems, and 
plot the fraction of correct recoveries in Figure [6] At left, we fix /ci = 1, while at right, ki grows as 
ki = m^/^. In both cases, as m increases, the fraction of errors that can be corrected also increases. 

c) Varying model parameters: We next investigate the effect of varying 6 (Figure [7] left) and u (Figure 
fright). We first fix m = 400, u = .3, and consider different bouquet sizes n = 100, 200, 300, 400, 500. 
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0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

Fraction corrupted, p Fraction corrupted, p 



ki = l ki= m^/^ 

Fig. 6 

Error correction in weak proportional growth. We fix S = 0.25, u = 0.05, and plot the fraction of successful 

RECOVERIES AS A FUNCTION OF THE ERROR DENSITY p, FOR EACH m = 100, 200, 400, 800, 1600. AT LEFT, fci IS FIXED AT 
1; AT RIGHT, ki = m^/^. IN BOTH CASES, AS m INCREASES, THE FRACTION OF ERRORS THAT CAN BE CORRECTED 

APPROACHES 1. 



Figure |7] left plots the fraction of correct trials for varying error densities p, for each of these bouquet 
sizes. For this fixed m, the error correction capability decreases only slightly as n increases. 

We next fix m = 400, n = 200, and consider the effect of varying u. Figure [7] plots the result for 
u = .1, .3, .5, .7, .9. Notice that as u decreases (i.e., the bouquet becomes tighter), the error correction 
capacity increases: for any fixed fraction of successful trials, the fraction of error that can be corrected 
increases by approximately 15% as u decreases from .9 to .5. 

d) Phase transition in total proportional growth: Theorem 1 does not provide any explicit infor- 
mation about the behavior of £^ -minimization when the signal support ki grows proportionally to m: 
ki/m — > pi G (0,1). Based on intuition from more homogeneous polytopes (especially the work of 
Donoho and Tanner on Gaussian matrices [24]), we might expect that when ki also exhibits proportional 
growth, an asymptotically sharp phase transition between guaranteed recovery and guaranteed failure will 
occur at some critical error fraction p* G (0,1). We investigate this empirically here by again setting 
5 = 0.25, u = 0.05, but this time allowing ki = 0.05m. Figure [8] plots the fraction of correct recovery for 
varying error fractions p, as m grows: m = 100,200,400,800, 1600. In this proportional growth setting, 
we see an increasingly sharp phase transition, near p = 0.6. 
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Fraction corrupted, p Fraction corrupted, p 



Effect of varying n. Effect of varying u. 

Fig. 7 

Effect of varying n and u. At left, we fix m = 400, v = .3, and consider varying n = 100, 200, . . . , 500. For each 

OF these iVIODEL SETTINGS, WE PLOT THE FRACTION OF CORRECT RECOVERIES AS A FUNCTION OF THE FRACTION OF 

ERRORS. Notice that the error correction capacity decreases only slightly as n increases. At right, we 

FIX m = 400, n = 200, AND VARY u FROM .1 TO .9. AGAIN, WE PLOT THE FRACTION OF CORRECT RECOVERIES FOR EACH 
ERROR FRACTION. AS EXPECTED FROM THEOREM 1, AS DECREASES, THE ERROR CORRECTION CAPACITY OF £^ 

INCREASES. 



e) Error correction with real face images: Finally, we return to the motivating example of face 
recognition under varying illumination and random corruption. For this experiment, we use the Extended 
Yale B face database [15], which tests illumination sensitivity of face recognition algorithms. As in [11], 
we form the matrix A from images in Subsets 1 and 2, which contain mild-to-moderate illumination 
variations. Each column of the matrix A is a it; x /i face image, stacked as a vector in M™ (m = w x h). 
Here, the weak proportional growth setting corresponds to the case when the total number of image 
pixels grows proportionally to the number n of face images. Since the number of images per subject is 
fixed, this is the same as the total image resolution growing proportionally to the number of subjects. 
We vary the image resolutions through the range 34 x 30, 48 x 42, 68 x 60, 96 x 84 The matrix A is 
formed from images of 4, 9, 19, 38 subjects, respectively, corresponding to 0.09. Here, v w 0.3. In 

"Thus, the total dimension m = 1020, 2016, 4080, 8064 grows roughly by a factor of 2 from one curve to the next, similar 
to the simulations above. 
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Fig. 8 



Phase transition in total proportional growtli. When the signal support grows in proportion to the dimension 
(fci/m pi e (0, 1)), we observe an asymptotically sharp phase transition in the probability of correct 

RECOVERY, similar TO THAT INVESTIGATED IN [24]. HERE, FOR S = 0.25, f = 0.05, fci = 0.05 m, WE INDEED SEE A 



face recognition, the sublinear growth of ||a;o||o comes from the fact that the observation should ideally 
be a linear combination of only images of the same subject. Various estimates of the required number of 
images, ki, appear- in the literature, ranging from 5 to 9. Here, we fix ki = 7, and generate the (clean) 
test image synthetically as a linear combination of ki training images from a single subject. The reason 
for using synthetic linear combinations as opposed to real test images is simply that it allows us to verify 
whether xq was correctly recovered; in the real data experiments of the introduction of this paper and of 
[11], success could only be judged in terms of the recognition rate of the entire classification pipeline. 

For each resolution considered, and for each error fraction, we generate 75 trials. Figure [9] (left) plots 
the fraction of successes as a function of the fraction of corruption. Notice that as predicted by Theorem 
1, the fraction of errors that can be corrected again approaches 1 as the data size increases. Figure [9] 
(right) gives a visual demonstration of the algorithm's capability. In the test images in Figure [9] (right, 
top), the amount of corruption is chosen to correspond to a 50% probability of success according to the 
plots in Figure |9] (left). Below each corrupted test image, the "clean" image recovered by our method is 



SHARP PHASE TRANSITION AT p = 0.6. 



shown. 
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Fig. 9 

Error correction with real face images. We simulate weak proportional growth in the Extended Yale B face 

DATABASE, WITH THE RESOLUTION OF THE IMAGES GROWING IN PROPORTION TO THE NUMBER OF SUBJECTS. LEFT: 
FRACTION OF CORRECT RECOVERIES FOR VARYING LEVELS OF OCCLUSION. RIGHT: EXAMPLES OF CORRECT RECOVERY 
FOR EACH RESOLUTION CONSIDERED. TOP: CORRUPTED TEST IMAGE. THE FRACTION OF CORRUPTION IS CHOSEN SO THAT 
THE PROBABILITY OF CORRECT RECOVERY IS 50%. BOTTOM: CLEAN IMAGE, FROM CORRECTLY RECOVERED Xq- 



IV. Discussions and Future Work 

a ) Compressed sensing for signals with varying sparsity: In the conventional setting for recovering 
a sparse signal, one often implicitly assumes that each entry of the signal has an equal probability of 
being nonzero. As a result, one typically requires that the incoherence (or coherence) of the dictionary 
is somewhat uniform. In this paper, we saw quite a different example. If we view both x and e as the 
signal that we want to recover, then the sparsity or density of the combined signal is quite uneven - x 
is very sparse but e can be very dense. Nevertheless, our result suggests that if the incoherence of the 
dictionary is adaptive to the distribution of the density - more coherent for the sparse part and less for 
the dense part, then ^^-minimization will be able to recover such uneven signals even if bounds based 
on the even sparsity assumption suggest otherwise. Thus, if one has some prior knowledge about which 
part of the signal is likely to be more sparse or more dense, one can achieve much better performance 
with ^^-minimization by using a dictionary with matching incoherence. More generally, for any given 
distribution of sparsity, one may ask the question whether there exists an optimal dictionary with matching 
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incoherence such that £ -minimization has the highest chance of success. 

b) Stability with respect to noise: Although in our model, we do not explicitly consider any noise 
(say y = Ax + e + z, where z is Gaussian noise), ^^-minimization is known to be stable under small 
noise [26]. This is also what we have observed empirically in our simulations and also in experiments 
with face images: ^^-miiumization for the cross-and-bouquet model is surprisingly stable to measurement 
or numerical noise. In fact, as the method is able to deal with dense errors regardless of their magnitude, 
large noisy entries in z will be treated like errors and be absorbed into e. However, a more precise 
characterization of the effect of noise (say Gaussian) on the estimate of the sparse signal x and the error 
e remains an open problem. 

c) Neighborliness of poly topes: As we have seen in this paper, a precise characterization of the 
performance of ^^-minimization requires us to analyze the geometry of polytopes associated with the 
specific dictionaries in question. In practice, we often use ^^-miiumization for purposes other than signal 
reconstruction or error correction. For instance, using machine learning techniques, we can learn from 
exemplars a dictionary that is optimal for certain tasks such as data classification [13]. The polytope 
associated with such a dictionary may be very different from those that are normally studied in signal 
processing or coding theory or error correction, leading to quaUtatively different behavior of the 
minimization. Thus, we should expect that in the coming years, many new classes of high-dimensional 
polytopes with even more interesting properties may arise from other applications and practical problems. 

Acknowledgments 

The authors would like to acknowledge helpful conversations with and useful comments from Prof. 
Robert Fossum (UIUC Math), Prof. Olgica Milenkovic (UIUC ECE), Prof. Sean Meyn (UIUC ECE), 
and Dr. Gang Hua (Microsoft Live Labs). This work is partially supported by grants NSF CRS-EHS- 
0509151, NSF CCF-TF-05 14955, ONR YIP N00014-05- 1-0633, and NSF HS 07-03756. John Wright is 
also supported by a Microsoft Fellowship (sponsored by Microsoft Live Labs, Redmond). Finally, Yi Ma 
would like to thank Microsoft Research Asia, Beijing, China, for its hospitality during his visit there in 
Summer 2008. 



September 1, 2008 



DRAFT 



MANUSCRIPT SUBMITTED TO IEEE TRANS. ON INFORMATION THEORY, 2008. 



30 



References 

[1] T. Figiel, J. Lindenstrauss, and V. D. Milman, "The dimension of almost spherical sections of convex bodies," Acta Math., 

vol. 139, no. 1-2, pp. 53-94, 2008. 
[2] B. S. Kashin, "The widths of certain finite-dimensional sets and classes of smooth functions," Izv. Akad. Nauk SSSR Serv. 

Mat., vol. 41, no. 2, pp. 334-351, 2008. 
[3] M. Ledoux, The Concentration of Measure Phenomenon, Mathematical Surveys and Monographs 89. American 

Mathematical Society, 2001. 

[4] R. Tibshirani, "Regression shrinkage and selection via the Lasso," Journal of the Royal Statistical Society Series B, vol. 58, 
pp. 267-288, 1996. 

[5] W. Fu and K. Knight, "Asymptotics for Lasso-type estimators," Annals of Statistics, vol. 28, no. 5, pp. 1356-1378, 2000. 
[6] N. Meinshausen and B. Yu, "Lasso-type recovery of sparse representations for high-dimensional data," to appear in Annals 
of Statistics, 2006. 

[7] R. Berinde, A. C. Gilbert, P. Indyk, H. Karloff, and M. J. Strauss, "Combining geometry and combinatorics: A unified 

approach to sparse signal recovery," Preprint, 2008. 
[8] V. Guruswami, J. R. Lee, and A. Razborov, "Almost eucUdean subspaces of via expander codes," Electronic Colloquium 

on Computational Complexity, Report No. 86, 2007. 
[9] A. Bruckstein, D. L. Donoho, and M. Elad, "From sparse solutions of systems of equations to sparse modeUng of signals 

and images," To appear in SIAM Review, 2008. 
[10] W. Bajwa, J. Haupt, G. Raz, and R. Nowak, "Compressed channel sensing," in Proceedings of Conference on Information 

Sciences and Systems, 2008. 

[11] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, "Robust face recognition via sparse representation," To appear in 

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. 
[12] J. Yang, J. Wright, T. Huang, and Y. Ma, "Image super-resolution as sparse representation of raw image patches," in Proc. 

IEEE Conference on Computer Vision and Pattern Recognition, 2008. 
[13] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, "Discriminative learned dictionaries for local image analysis," 

in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2008. 
[14] E. Candes and T. Tao, "Decoding by linear programming," IEEE Trans. Information Theory, vol. 51, no. 12, 2005. 
[15] K. Lee, J. Ho, and D. Kriegman, "Acquiring linear subspaces for face recognition under variable lighting," IEEE Trans. 

on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 684-698, 2005. 
[16] D. Donoho, "For most large underdetermined systems of linear equations the minimal /i-norm solution is also the sparsest 

solution," Communications on Pure and Applied Mathematics, vol. 59, no. 6, pp. 797-829, 2006. 
[17] J. Tropp, "Greed is good: algorithmic results for sparse approximation," IEEE Trans. Information Theory, vol. 50, no. 10, 

pp. 2231-2242, 2004. 

[18] D. Needell and J. Tropp, "CoSAMP: Iterative signal recovery from incomplete and inaccurate samples," To appear in 

Applied and Computational Harmonic Analysis, 2008. 
[19] Y. Eldar and M. Mishali, "Robust recovery of signals from a union of subspaces," preprint, 2008. 
[20] M. Sipser and D. A. Spielman, "Expander codes," IEEE Transactions on Information Theory, vol. 42, no. 6, 1996. 
[21] J. Feldman, T. MaMn, R. Servedio, C. Stein, and M. Wainwright, "LP decoding corrects a constant fraction of errors," 

IEEE Trans. Information Theory, vol. 53, no. 1, pp. 82-89, 2007. 



September 1, 2008 



DRAFT 



MANUSCRIPT SUBMITTED TO IEEE TRANS. ON INFORMATION THEORY, 2008. 



31 



[22] M. R. Capalbo, O. Reingold, S. P. Vadhan, and A. Wigderson, "Randomness conductors and constant-degree lossless 
expanders," in Proceedings of the 34th ACM Symposium on Theory of Computing, 2002, pp. 659-668. 

[23] N. Kashyap, "A decomposition theorem for binary linear codes," IEEE Trans. Information Theory, vol. 54, no. 7, pp. 
3035-3058, 2008. 

[24] D. Donoho and J. Tanner, "Counting faces of randomly projected polytopes when the projection radically lowers dimension," 

preprint, http://www.math.utah.edii/ tanner/, 2007. 
[25] D. Donoho, "Neighborly polytopes and sparse solution of underdetermined linear equations," preprint, 2005. 
[26] , "For most large underdetermined systems of linear equations the minimal ^'^-norm near solution approximates the 

sparest solution," preprint, 2004. 
[27] S. Dasgupta, D. Hsu, and N. Verma, "A concentration theorem for projections," in Conference on Uncertainty in Artificial 

Intelligence (UAI), 2006. 

[28] E. Candes and J. Romberg, "£^-magic: Recovery of sparse signals via convex programming," 

http.V/www. acm. caltech. eduAl magic/, 2005 . 
[29] D. Needell and R. Vershynin, "Signal recovery from inaccurate and incomplete measurements via regularized orthogonalized 

matching pursuit," preprint http://www.math.ucda\'is.edu/^dneedell/, 2007. 
[30] R Wedin, "Perturbation bounds in connection with singular value decomposition," BIT Numerical Mathematics, vol. 12, 
pp. 99-111, 1972. 

[31] N. Alon and J. Spencer, The Probabilistic Method. Wiley-lnterscience, 2001. 
[32] T. Ferguson, A Course in Large Sample Theory. Chapman and Hall, 1996. 



September 1, 2008 



DRAFT 



MANUSCRIPT SUBMITTED TO IEEE TRANS. ON INFORMATION THEORY, 2008. 



32 



Appendix 
Technical Lemmas and Results 

A. Restricted Isometry for Sparse Vectors 

Here, we give a more precise statement of the restricted isometry property of \y* — SU*] used in 
the proof of Lemma jij For an arbitrary matrix M, we defined 7fc(M) = inf||yj|g<;, y-^g 
are interested in knowing ^cm.{\y* — SIJ*]), where U, S, and V come from a (compact) singular value 
decompositiorj^ of P = ^2^^^^, after dropping the largest singular value. The constants in the following 
result are less important than the fact that for c sufficiently small, = ^{f)- 

Lemma 5 (Restricted Isometry): Suppose that p < 5, v < 1/%, and c is sufficiently small: 

c < mini — ^ -7—^ I, pH(c/p) + 6H(c/6) < — (41) 

\ 1024 ' 64(1 + 2C^^-i/2)2 j' v/a^;t V / ; ^^Stt^' ^ ' 

where if(-) is the base-e binary entropy function. Let u\^v\ denote the first singular vectors of P = 
A27Ta± G ]^{Sm~ki)xpm_ ^hen if USV* is a compact singular value decomposition of tt^jJ- Pvt^j-l , 

7cm( [V* - SU*] ) > ^ (42) 

lb 

on the complement of a bad event of probability < 6-^01(1+0(1)) 

Proof: Notice that the conditional distribution of P given Ai is Gaussian: P = Z^vr^i + l/.i}c7r^-L = 
Zgvr^jL + Ifi*. We argue that the second term dominates: 

a) determines the leading singular vectors: Since the columns of Ai are ki small perturbations 
of fija, the residual ||/i|| = ||vryi-L/ijc|| should be small. However, we will see that it is not too small: 
||7r^-L/ijc|| = ^{k^^^'^). Choose an orthonormal basis for M^™, with first basis vector The 

expression of Ai w.r.t. this basis is [^] + ei(c* + ||/Xjc||l*) = [°] + eii>*, where B and c are 

2 

iid Af{0,u'^/m). So, 



can be written as 

2 



^—Cm 



Applying Factjljto the (pm—l) x ki matrix B, one can easily show that P \\{B*B) 
By ( [2T] ) above, the norm of the A;i -dimensional J\f{0, u^/m) vector c also concentrates: P [\\c\\ > x 
^-C'mki^ On the complement of these bad events, \\v\\ < \\c\\ + ||/xjcl^J| = (1 + ||/2jc < 

'**With probability one, the matrices U and V are unique upto multiplication of their columns by a common set of signs. The 
quantity of interest, 7fe, does not depend on the choice of signs, so there is no ambiguity in writing yk{[V* — SU*]). 
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and v*{B*B)-^v < ^ki. So, 



1 

1 



> 



1 + 



(43) 



Lemma [5] below shows that with probabiUty > 1 — e_-Cm(i+o{i)) -^^ jj^g random support of the error cq, 
> p/2. Together with ( |43] ), this impUes that = ||/x jc — 7rAi/.tjo II2 > ^\JyT~^^~F- good 
event, Hl^m-fci A*ll2 > Ci m*/^ for some constant Ci and m sufficiently large. From Fact [l] ||^2|| is 
bounded by some constant C2 with probability at least 1 — ^-Cm{i+o{i)) _ Treating Z2'JTj^± as a nuisance 
perturbation of Ifi* and applying Wedin's perturbation bound for principal subspaces [30] then gives 

uiul 



ului 



11* 11* 
1*1 ri 



^ ItlU^ ^ uiul \ 11* 





UiUi 




< 2 






ului 





2 ^9 TT /I -L 

< " ^ " < 
- II1A1I - 



U^Ui U^Ui ) 1*1 
2C2 



Similarly - tt^^H < cT^' ^"""^ 



vr,- 



Now, ||vriiP|| < \\Z2\\ < C2, and ||P7r^,_L|| = ct2(P) < \/2z^(yp + simultaneously with probability 
> 1 - e-C-mCi+oCi)) (the second bound was established in part (a) of the proof of Lemma [sjl. Hence, 
3C3 such that P [ \\tTu^Ptt^± - tt^^ Pit fj^±\\2 > Csm"'"'/^ ] x e"*^™. For an arbitrary matrix W, let 
f{W) = Jcmii'^niW') ~ W*]). We are interested in /(tt^^j, Ptt^^-l ) p| Using the fact that singular values 
of submatrices are 1-Lipschitz and applying Wedin's sin0 theorem [30] to ttpk^w*)' it is not difficult to 
show that if rank(l^ + A) = rank(M^), 

1 



f{W + A)-f{W)\ < 



+ 1 IIAI 



(44) 



^CTminiW) - \\A\ 

where amin{W) is the smallest nonzero singular value. Applying this bound with W = tt^±Pit^±, 
A = TT^±PiT^± — 7ri± Pit and noticing that (Tmin{T^uj-PT^vj;) is bounded below by a positive constant 
with overwhelming probability, we have that | / [t^^i-P-k^^^ — f (vriiPvr^i) | < with probability 
at least 1 — e-C»"(i+o(i))_ We henceforth restrict our attention to /(vriiPvr^i). 

b) Analysis via Gaussian measure concentration: Let S denote the subspace {lZ{Zi) + 7^(/ijc ))"'", 
and let Vq be some orthonormal basis for this subspace, chosen independently of Z2. From the above 
reasoning, we can restrict our attention to vriiPvr^i = 7riiZ|7r£. Let 'k-i±Z2'Ky. = U'S'V* be a compact 
singular value decomposition of this matrix. Then, 



-S'U' 



1cm V* 



I 7r2^27ri-L 



7cm Vq 



I 7r2^27ri-L 



"Since left multiplication by an orthogonal matrix does not change 7cm., /(''r„i Ptt^^ ) = ^cm{\V* — SIJ*\). 
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Where the final step follows because jcm is invariant under left multiplication of its argument by an 
orthogonal matrix. Now, Vq7T^Z2 = Vq Z2 is simply distributed as a {pm — fci — 1) x {5m — ki) iid 
AA(0,i^^/m) random matrix. Finally, introduce an additional uniformly distributed random orthogonal 
matrix Q G M(P'n-fci-i)x{pm-fci-i)^ chosen independently of Z2, and define ^' = ^^0*^2^2- This is 
again an iid J\f{0, v'^/m) matrix. Notice then, that 7^™ V'* -S'U'* ) = 7cm ( QVq '^tti± )■ 
From the rotational invariance of the Gaussian distribution, it is easy to show that \I' and Q are independent 
random variables. QVq is the transpose of random orthobasis for S; it can be realized by orthogonalizing 
the projection of a Gaussian matrix onto S. To this end, introduce an iid M{0,v'^ /m) matrix <I> G 



o(pm— fci — l)xpm 



independent of S and Then, 7c 



is equal in distribution to 



-1/2 



Let A = (<^7rs<I'*)~^/^, and notice that 



mm amin 

#LiU-L2=cm 



> 



mm mm-! 

#Li=#L2=cm 



max 

#Li=#L2=cm 



where S' denotes the subspace TI{[A^ity:]»,Li)- 

c) Bounding (Train [A<I>7rs].,L-' Applying Fact[l]to <I>7rE gives that -P [ ||<I>7rE||2 > 3z^^] x e ^""/^ 
On the complement of this bad event, cJmin(A) > g^^ . Write 



Straightforward application of Fact 1 shows that P (Tmin{^»,L) < 







while for 



an 
T = 

||T^,.| 



2 

£\ > 0, -P [II vr$.^<^,, LcVTj^^]^, ^ II > 2z^^ + z^^/pelj x e"^^!™/^. Finally, consider the matrix 



Zi vJp 



-p£im/2 

G Mp™x{fci+i)_ interested in ||[7rs^]L,L| 



Tl,,(T*T)-it;_^ 



< 



. It is not difficult to showQthat w.p. > 1 - e-T^^-^+°^^^\ cTminiV > Meanwhile for any 
£2 > 0, P[ ||[Zi]i ,|| > V\fc^ v^fpe2\ s< e~^^^™/^. On the complement of this bad event (and invoking 
Lemma [6]l 



|Ti,.|| < ||[Zi]l,.|| + 



^"Since ^..lc is independent of and E, the norm of tt^, ^ $,,_Lc7r[^j,]^^ is simply distributed as the norm of a cm x cm 
iid 7V(0, i^V"^) matrix. By Factjlj P [ik*.^ -l'.,Lc7r[^^j^^^ || > 2v^c + tv^^ < e-(*-°(i))'=™-/2. Set t = 

^' Write CTmi„(T) > (7™™ Q %|c-^i w't-'j^W ] ) ~ H'^'^^-''^ H - min ^crmin(7r^|^ ^i), z^^p) ~ Ik^jo -^i ll> apply 

Fact 1 to the singular value and standard tail bounds to the fci -dimensional A/'(0, /m) vector ^J^^^Zi. 
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By the assumptions of the lemma, ^/c{l + ^) < and ||[7r£_L]L,L|| < ^^^"^ < 4 1/8)^. 

Setting ei = 62 = ^, ||[vrs-L]L,L|| < 1/4, ||vr$.^^>,,L-vr[^^]^,^ || < 2u^/c + u^/8, and so 



\\r. 



(45) 



and amin (|[A$vrs],_j 



> tW — > ^on the complement of a bad event of probability e 



j__ n 

12 12 A/ p 



The number of subsets L of size cm is e'""^^'^/'')^^"'""^^)). The probability any L is bad is bounded by 
^pm[H(c/p) -1^28) which falls off exponentially when H{c/p) < 1/128. This is guaranteed for 
c/p < 1/1024. 

d) Bounding amin (tts'-l [^'Tr^i], Recall that S' denotes the cm-dimensional range of [A$7rs]»,Li- 
Choose any orthonormal basis for the [{p — c)m — ki — 1] -dimensional subspace S'^. The expression of 
the columns of tt-s>±"^ with respect to this basis is a ((p — c)m — /ci — 1) x (6m — ki) matrix ^ with 
entries AA(0, i^^/m). Split ^7r-i± as 



1 



6m — ki 



1 



6m — k 



Using the independence of 



1 

Sm—ki 



^',.Lcl and and applying Fact 1 it is not difficult to show that 



P 



1 



6m — ki 



< e 



(46) 



For the other term. 



< + o(l)). From Fact 1, > Si/^P 

< ^(1 + o(l)) < ^y^c^ 



< 



e '^2' On the complement of this event. 



eventually. Since 



< 



< 



32' 



1 



Sm—ki 



5m— ki 
< 



1 



-(p-c)m/8(l+o(l)) 



1 



m 



All together, with probability at least 



6m — ki 



7v^p 9 3 

> — o^V C > o^V P 



16 



There are x e^™^^^/^) subsets Li of size cm and x ^^^^{(^1^) subsets L2 of size cm. The total number of 
choices of Li, L2 is asymptotic to e^^^^i^^~^^^^^^^^ , and the probability that any pair is bad is bounded 
by a function asymptotic to exp^( pH[c/p) + 6H{c/6) — m(l + 0(1))^ Under the assumptions of 
the lemma, the exponent is negative. 

^"Translation does not substantially affect the bound on (Jmin in Fact [I] for an m x n iid A/'(0, matrix M and an 
independent translation x, (Jmin{M + xl*) > aminiir^^M), which obeys the same concentration result, now applied to an 
(m — 1) X n matrix. Appropriate rescaling of the ((p — c)m — fci — 1) x cm Af{0, v'^ jm) matrix ^'.,1, yields the desired 
expression. 
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e) Bounding the cross-coherence 



TTs" [^T^i^], L. ■■ Let S" denote the subspace 7^([A$7rs].,Li; 



Notice that S" and 5' are probabilistically independent. Now, 



< |kE"*.,LJ| + 



'-Sm-ki ^cm 



8m 



< |ks"*.,LJ| + 



m 



5m- 



eventually, since 
cm X cm iid A^(0, u'^/m) matrix, and so for any si > 0, 



^ < 1 eventually. Now, ||7rs"^'»,i,2ll is distributed as the norm of a 



Similarly, tts"^! is has the same norm as a cm-dimensional iid A/'(0, u'^/m) vector, so 



P 



1 



=7rs"*l 



\/ 5m — ki 

On the complement of these two bad events, 
S2 = 1/16. Then w.p. > 1 — e~ 



< g-2£ipm/7r=_ 



(47) 



(48) 



r(l+o(l)) 



TTs" [*7ri-L],_^^ < (ei + £2) + 3 1/^. Set e\ = 
TTE" [^'VTi^l.^i^ ^ ^ + < We again union 

bound over Li,L2. The number of such pairs is asymptotic to e^^^^'')^''^^*-')™, and the probabiUty 
of some bad pair is bounded by a function asymptotic to exp {i^H + (5iJ(|) — j^f^^ mj . Under 
the hypotheses of the lemma, the coefficient of this exponent is negative. 

f) Pulling the bounds together: For v < 1/9, < ^ < min^,^ cr^j„([A$7rs]«,Li), and so 

this quantity lower bounds min^^^L, min|(7min([A$7rs],,Li), o-mm (tts'-^ [*7ri-L],,Lj}. So, w.p. > 1 - 



e-^^^^^+^W), 7cm([(*T.*-)-/=*^. ^TT,.]) > = Since 



\lcm{[V' -SC/*]) -7cm([(*7r^**)-'^'*Ts ]) | < 

the desired bound follows. 



16 ' 



Lemma 6: Let J'^ be chosen uniformly at random from (^), and let /Li G M"* with ||/^||2 = 1 and 
IImIIoo < C^m^^/^. Then ||/ijc II2 > p/2 on the complement of a bad event of probability < er'-^'^^^'^°^^^\ 
Proof: Form the subset J'^ by choosing pm indices ji . . .jpm, with jj chosen uniformly at random 
from [m] \ Let Yq^Yi, . . .Ypm denote the Doob process associated with ||/.tjc||2: Yq = 

E [ IIMj. Hi ] = P and n = E [llyuj. g \ ji . . . j^] . Then, letting Xk = E^i 1^1' ^fc = ^fe + ^(P^ ' 

= and 



in+i-ni 



/)m(Xfc + ) + 1 pmXk + 1 



m, — k — 1 



m — k 



^ pmXk + p^m^nj^^^ + 1 ^ 1 



pm m p^m^ 



The above is < Cm ^ for appropriate constant C. By Azuma's inequality (Theorem 7.2.1 of [31]), 



P[\Y-pm-p\ >t\< 2exp(^- 



2pm{C'/mY 



exp(— Cm). 



(49) 
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B. Technical Lemmas for Initial Separating Hyperplane 

This section contains two results used above for controlling the initial separator Qq. We first justify 
the assertion that [^^ (G*G)~^Z}, cr is the only term that contributes 0(m^/^) to ||goll> and then 
close with a measure concentration result for \\9 ■ ||, also used in the proof of Lemma |4] 

Lemma 7 (Lower order terms in Qq): Suppose that p < S and u < g^^_^_-^^ ■ There exist constants (wrt 
m) Cg and Cq such that 

\\{G*G)-^ < Cg and \\q, - [^^ {0*0)"' Zl,a\\ < Cqm'/^-^"/' (50) 

simultaneously on the complement of a bad event of probability < e"*^"^^ ^'o/-(i+o(i)) 



Z^Zi ZIZ2 

ZZZi 2^2'^2"l"I 



Proof: Write Q 
IC* + all*, where a = fXjai^j.. So, 



\ and C = Z*j^ E M". Then G*G = Q + C^* + 



{G*G)-^ = Q-'-Q 



-1 



1-1 



1 c 



a 



1* 



Q 



-1 



(51) 



Set b = 1*Q-^1, c = 1*Q-^C, d = CQ~^C, and write {G*G)-^ = Q-^ - Q-^/^MEM*Q-^/^ with 
M = \ .Tl'li. ] and H 



h(a-d) -V6(i(c+1) 
-\/M(c+1) hd 



(52) 



6 (a - d) + (c + 1)^ ■ 

We next bound the quadratic terms h, c, and d. Applying Fact[T]to the 6m x pm iid AA(0, jra) matrix 
Zjc , = [Zi Z2] gives that ||Zje^,||2 < ^plv [^fb + ^p) w.p. > 1 - e-C™(i+°(i)). On the complement 
of that bad event, 



> 



bra 



> 



5m 



= Cf) m. 



WQW - l + ||Zjc,.||2 - l + 2u^V6 + ^r 
Similarly, b < 5m/amin{Q)- It is not difficult to showp^ that for any block matrix M = f 

[A) < 1, 



(53) 
with 



\AP\\B\ 



By Fact [T| on the complement of an event of probability x e 



Cm 



{A) 



l^if < \\Zif < 2v^p, 



\Z2f < \\Z2f <1v^ ( + 



On the good event above, for v < \, CTminiZi) < \\Zi f < 1. Plugging in, amin{Q) = CTmin {[^0^1]) ^ 



V2 



> 'LLP 
— 4 



for V sufficiently small (e.g., v < 



8(v^+l) 



suffices). 



and so 6 < w.p. > 1 — e 



Cm(l+o(l)) 



^Write (T^i„(M) > minn^^ ||2_|^||^2 (||yla;i ||2 - ||-Ba;2 1|2)^ + ||a;2||2- Setting A = \\xi\\l, the previous is > 



minA6[o,i] crLn(^) + (1 - '^Ln{A)){l - A) - 2P||||B||Vl- A, which is minimized at Vl - A = 
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For c = 1*Q~^C, notice that C = Z*j, ,iija is iid 7V(0, z^^a/m). Write Q = Z}, ,7r^|^2'jc^, + [[] 0] + 



hCC = L+iCC,thenQ-i = L~i-L-iC^^+j^C^~^and|rQ-iC| = 1*L~^C 
\1*L~^(^\. An identical argumenj^to the one given above for Q shows that on the complement of an 
event of probability x e"*^*", amin{L) > and so ||L~^1||2 < ^^m}/'^. Since C, is independent of 
L, ( ' / simply an N , v"^ a / m) random variable, and so for any e > 



< 



\rL-^C\>em^'^ 



< P 



\L-'1\\ > 



4VS 1 



m 



/2 



+ P 



\L-^1\ 



X 



> e 



-C.m 



for some constant Cg (where we have controlled the second part via standard Gaussian tail boundq^. 
So, with overwhelming probability, |c| = |1*(5^^CI ^ em^/^. 

The final quadratic term is d = C*Q^C = C*I^^^C a+CL-^C — C*L^^C- The norm of the 6m- 
dimensional AA(0, z^^a/m) vector C concentrates: by ( [2T] |, ||^||2 < \f2v\f(^ with probability at least 
1 — ^-Cm(\+o(\)) _ exploit the fact that although = 0{y^'^), for most vectors L is well- 

conditioned (due to the presence of the identity matrix in [^^ '^^j)- Consider the subspace Yj = {x \ 
a;/ = 0} C M". Since for all a; G S, ||Lic||2 > ||a;||2, H^^-^l^sH < 1, and 



< 



\L 



+ ll^^^lbllClb ||^(LS)^C||2 < 2zy^a(^ + 



lLi;ll2 



4\/2^ 
vp 



The norm ||vr(j^2)^CII of the projection of C, onto an independent fci -dimensional subspace is distributed 
as the norm of a fci -dimensional Mi^^v^ajm) vector: P [ 
appropriate e, with overwhelming probability, d < C*L^^C ^ ^v'^aS. 

The denominator of H in ([52]) is d)+(c+l)^ > Ci,a{\—Av'^ 5)m. By Lemma[6j a = \\^lJc ||| > /j/2 
w.p. > 1 — e~'-^™(^+''(^)\ and so the denominator is > Cdenomfn with overwhelming probability. Since 
each of the terms in the numerator is < Cm with overwhelming probability, ||H|| < C= for appropriate 
constant (7=. Since the columns of M have unit norm, ||M|| < 2, and 

4 



\\{G*G)-^\\ < \\Q-^\\ + \\Q-^\\\\Mf\\E\\ < 
a constant, establishing the first assertion of the lemma. 



v^p v^p 



Cg. 



^"^Consider instead <j'i^i„ 



The singular values of ti^i_ Z2 are distributed as those of a (pm 



1) X [Sm — ki) iid Af{0, v Irn) matrix. The bounds given by Fact|T|are essentially the same as those for Z2 
^^For example, if X is A/'(0,a^), P[|X| > at] < t^^e"''''^ 
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We next extend the above reasoning to bound {G*G) ^1 and 1*{G*G) ^1. Notice that 

c + 1 1 



{G*G)-H 



6(a-d) + (c+l)2 ^ ^ a -(i + (c + 1)2/5 

For any e > 0, |Ai| < J^oL^j) < c -m'£'^['i-^4t/^5) ^^'■'^ overwhelming probabiUty. Hence for any e" > 0, 
I All < e"m^^/2 for m sufficiently large, on the complement of a bad event of probability x e"*^™. 
Similarly, IA2I < ^ < jn^^y and so 



p (1-41/25)' 

\\{G*G)-Hh < |Ai|||Q-i||l|| + |A2|||Q-i||C|| < + 



8V26 



= Ci. 



Similarly, r(G*G)-il = ,(,_,)t(c+i)2 < ^(™) = ^^2- 

We need one more bound, for |(/ij,<t)|. Consider the Martingale {Xi)'^Q given by Xq = 0, = 

Yl]=i t^jU)^ij)- interested in Xpm = {fJ.j,cr). Since |Xj — < by Hoeffding's 
inequality [31], 



P[\Xpm\>t] < 2exp 



2V^™ /y2 

and so with probability > 1 - e-C'^'"""^', \{tij,a)\ < mi/2-'?o/4_ 
With these results in hand, recall that 



(54) 



9o 



[l^l^]iG*G)-'Zl,a + (-{G*G)-Hj + {f,j,a){G*G)-'l 



The second term of (55 1, 



+ [^'^'] [-1*{G*G)-Hi + {nj,a)l*{G*G)-^lj + ["^^^ ] 1* {G* G)-^ Z^, a . (55) 
[ f ] (-(G*G)-il, + {^lJ, a) {G*G)-H' 



is bounded above by 



w.p. > 1 - e-C'"'"''°^'{i+°{i)). Similarly, for the third term of (|55l 



For the final term of ( [SS] ), i9 = Z},<t is distributed as an iid AA(0, z/ p) vector, independent of G, and so 



^ ^ \\{G*G)-^1\ 
On the complement of this bad event. 



\fijA*{G*G)-^-d\\ < \\{G*G)-'^1\ 



I iG*G)-H 

\ii(G*G)-iiir 



< C7i mi/2-'?o/4_ 



(56) 



(57) 
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Lemma 8 ( Concentration for Gaussian tops): Fix a < 1, e < 1/2. Let a; be a d-dimensional random 
vector with entries iid M{0, a^), and let 9 be the operator that takes the part of x above 1 — e: 

, sgn(x(i))(\x(i)\ — 1 + e), \x(i)\>l — e, 
such that = <( ^ ^ ^ i v 7i ^^^^ 

0, else. 



Then P 



Ixh > 4e~^ S/"^ 



e ^"'^^ where Ca is a constant (w.r.t. d) depending only on a. 



Proof: Let y G M'' be iid 7\/(0, 1), then ||6'a;||2 is equal in distribution to ||0(Ty||2. Now, E||6'(j^/||2 
d-W.{ex{i)f = !^^t^e-*^/'^''^dt. Integrating by parts ^6 yields 

\J'n/2 V vr 1 - e 



and E[||6'CTy||2] < 2e~^d^/'^. Meanwhile, ^f^J6'o-y(i)|2 = y/d ^^^J-Mlll!. It is not 
difficult to show|2^that E^ ^-^ \s^vW . ^ for some constant C'„ > 0, and so E\\eay\\2 > C'^d^/^. 
Since /(•) = \\9a ■ ||2 is 1-Lipschitz for a < 1, P[\\eay\\2 > 2 E||6lcJ2/||2 ] < exp {-8(E\\eay\\2f /tt'^) 
[3]. Plugging in the upper and lower bounds on E|0cr7/||2 yields the result. ■ 

C. Details of the Proof of Theorem I 

Proof: Consider the weak proportional growth setting 'WFGs,p,Co,vo ^^^^ p < 6. We first consider 
a fixed, arbitrary sequence of signal supports / G Ckj)- Lemma 2j {I, J, a) is ^^-recoverable if 
3c G (0, 1) such that 

llgolk + T^ll^golb < = {l - e)y/c{p + 5)m^/^ + o{m^/^), (59) 

where ^ = inf||s|||j<cp ||vr7^(G)s||2/||s||2- Choose c small enough that /3 = (p + 6)c satisfies /3 < 

(since in Lemma jij ||s||o is a fraction of 
m, not p). Further suppose that u < min(|, ^^y^^; (512/(5)"^/^). Then by Lemma jsj ^ < 1 — C^v^, 
with probability 1 - e-C™(i+°(i)). 

Meanwhile, by Lemma|4| with probability at least 1 — e"*"™' "o/^(i+o(i))^ llQolb < aii^m^/^ + o(m^/^) 
and ll^^goll — 0(2i^^^e^e^ . On the intersection of these three good events, the left hand side of ( [59] ) 
becomes 

llQolb + ^j-^llfgoll < aiz^m^/2^a2Z/-^exp^-^jmi/2 + o(mi/2). (60) 
^'And noting that Q(z) < ^=6"^"''^ 

^^Apply the strong law of large numbers to d~'^ ^\0(jy{i)\'^ and Slutsky's theorem (Theorem 6 of [32]) to argue that 
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For u sufficiently small, this is < (1 — e)^Jc{p + 5)m^^'^ + o(m^/^), and hence, for m sufficiently large, 
on an event of probability > 1 — exp (— Cm^~''''/^(1 + o(l))), {I, J, a) is £^ -recoverable. There are 
(^) < exp(m^~''° logm) subsets /, and so the probability that (/, J, a) is not ^^-recoverable for some 
/ is bounded by 

exp(|-Cm^-^°/2(l + o(l))) xexp(m^-''«logm) = exp (^-Cm^-''°/'^{l + o{l))^ = o(l), 
establishing the theorem. ■ 
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